As data sets grow in size, analytics applications struggle to get instantinsight into large datasets. Modern applications involve heavy batch processingjobs over large volumes of data and at the same time require efficient ad-hocinteractive analytics on temporary data. Existing solutions, however, typicallyfocus on one of these two aspects, largely ignoring the need for synergybetween the two. Consequently, interactive queries need to re-iterate costlypasses through the entire dataset (e.g., data loading) that may providemeaningful return on investment only when data is queried over a long period oftime. In this paper, we propose DiNoDB, an interactive-speed query engine forad-hoc queries on temporary data. DiNoDB avoids the expensive loading andtransformation phase that characterizes both traditional RDBMSs and currentinteractive analytics solutions. It is tailored to modern workflows found inmachine learning and data exploration use cases, which often involve iterationsof cycles of batch and interactive analytics on data that is typically usefulfor a narrow processing window. The key innovation of DiNoDB is to piggyback onthe batch processing phase the creation of metadata that DiNoDB exploits toexpedite the interactive queries. Our experimental analysis demonstrates thatDiNoDB achieves very good performance for a wide range of ad-hoc queriescompared to alternatives %such as Hive, Stado, SparkSQL and Impala.
展开▼